<<<<<<< HEAD Data Analysis Project : ======= Data Analysis Project >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01 <<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

Marketing analysis

<<<<<<< HEAD


I - General introduction



Origin
The present dataframe has been created for marketing analysis purposes. It assembles various personal information about 2239 customers, such as their education level, income, age, marital status, number of children at home…

It also shows their consuming habits (amount spent on wine, on sweets…) and the number of purchases made on discounted products.

There is very few context concerning this dataframe, since the source is unknown. It is not clear when these informations were registered, but probably by 2014 since the date of customers’ enrollment within the company doesn’t go further than 2014.

Aim
To predict the customer’s behavior (Number of purchases made with a discount) depending on the most significant personal attributes

Attributes

  • People

ID: Customer’s unique identifier
Year_Birth: Customer’s birth year
Education: Customer’s education level
Marital_Status: Customer’s marital status
Income: Customer’s yearly household income
Kidhome: Number of children in customer’s household
Teenhome: Number of teenagers in customer’s household
Dt_Customer: Date of customer’s enrollment with the company
Recency: Number of days since customer’s last purchase
Complain: 1 if customer complained in the last 2 years, 0 otherwise

  • Products

MntWines: Amount spent on wine in last 2 years
MntFruits: Amount spent on fruits in last 2 years
MntMeatProducts: Amount spent on meat in last 2 years
MntFishProducts: Amount spent on fish in last 2 years
MntSweetProducts: Amount spent on sweets in last 2 years
MntGoldProds: Amount spent on gold in last 2 years

  • Promotions

NumDealsPurchases: Number of purchases made with a discount
AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise
AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise
AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise
AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise
AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise
Response: 1 if customer accepted the offer in the last campaign, 0 otherwise
NumStorePurchases: Number of purchases made directly in stores

=======


I - General introduction

Origin The present dataframe has been created for marketing analysis purposes. It assembles various personal information about 2239 customers, such as their education level, income, age, marital status, number of children at home…

It also shows their consuming habits (amount spent on wine, on sweets…) and the number of purchases made on discounted products.

There is very few context concerning this dataframe, since the source is unknown. It is not clear when these informations were registered, but probably by 2014 since the date of customers’ enrollment within the company doesn’t go further than 2014.

Aims To predict the customer’s behavior (Number of purchases made with a discount) depending on the most significant personal attributes To categorize participants in a few typical profiles (probably with PCA)

Attributes

  • People

ID: Customer’s unique identifier Year_Birth: Customer’s birth year Education: Customer’s education level Marital_Status: Customer’s marital status Income: Customer’s yearly household income Kidhome: Number of children in customer’s household Teenhome: Number of teenagers in customer’s household Dt_Customer: Date of customer’s enrollment with the company Recency: Number of days since customer’s last purchase Complain: 1 if customer complained in the last 2 years, 0 otherwise

  • Products

MntWines: Amount spent on wine in last 2 years MntFruits: Amount spent on fruits in last 2 years MntMeatProducts: Amount spent on meat in last 2 years MntFishProducts: Amount spent on fish in last 2 years MntSweetProducts: Amount spent on sweets in last 2 years MntGoldProds: Amount spent on gold in last 2 years

  • Promotions

NumDealsPurchases: Number of purchases made with a discount AcceptedCmp1: 1 if customer accepted the offer in the 1st campaign, 0 otherwise AcceptedCmp2: 1 if customer accepted the offer in the 2nd campaign, 0 otherwise AcceptedCmp3: 1 if customer accepted the offer in the 3rd campaign, 0 otherwise AcceptedCmp4: 1 if customer accepted the offer in the 4th campaign, 0 otherwise AcceptedCmp5: 1 if customer accepted the offer in the 5th campaign, 0 otherwise Response: 1 if customer accepted the offer in the last campaign, 0 otherwise NumStorePurchases: Number of purchases made directly in stores

Before loading the dataset, we want to make sure we have all the necessary packages installed and loaded, and that the code can be run by anybody.

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
if(!require(pacman)) {
  install.packages("pacman")
  library(pacman)
}
<<<<<<< HEAD
suppressPackageStartupMessages(pacman::p_load(tidyverse, gtsummary, ggpubr, moments, here, sjPlot, parameters, effectsize, pander, psych))
=======
suppressPackageStartupMessages(pacman::p_load(tidyverse, gtsummary, ggpubr, moments, here, sjPlot, parameters, effectsize, pander))
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

path =  here("JULIETTE") 
setwd(path)
data <- read.table("marketing_campaign.csv", header=T, sep="\t")
<<<<<<< HEAD

II - Data overview and cleaning

=======

II - Data overview and cleaning

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

Structure

str(data)
## 'data.frame':    2240 obs. of  29 variables:
##  $ ID                 : int  5524 2174 4141 6182 5324 7446 965 6177 4855 5899 ...
##  $ Year_Birth         : int  1957 1954 1965 1984 1981 1967 1971 1985 1974 1950 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Single" "Single" "Together" "Together" ...
##  $ Income             : int  58138 46344 71613 26646 58293 62513 55635 33454 30351 5648 ...
##  $ Kidhome            : int  0 1 0 1 1 0 0 1 1 1 ...
##  $ Teenhome           : int  0 1 0 0 0 1 1 0 0 1 ...
##  $ Dt_Customer        : chr  "04-09-2012" "08-03-2014" "21-08-2013" "10-02-2014" ...
##  $ Recency            : int  58 38 26 26 94 16 34 32 19 68 ...
##  $ MntWines           : int  635 11 426 11 173 520 235 76 14 28 ...
##  $ MntFruits          : int  88 1 49 4 43 42 65 10 0 0 ...
##  $ MntMeatProducts    : int  546 6 127 20 118 98 164 56 24 6 ...
##  $ MntFishProducts    : int  172 2 111 10 46 0 50 3 3 1 ...
##  $ MntSweetProducts   : int  88 1 21 3 27 42 49 1 3 1 ...
##  $ MntGoldProds       : int  88 6 42 5 15 14 27 23 2 13 ...
##  $ NumDealsPurchases  : int  3 2 1 2 5 2 4 2 1 1 ...
##  $ NumWebPurchases    : int  8 1 8 2 5 6 7 4 3 1 ...
##  $ NumCatalogPurchases: int  10 1 2 0 3 4 3 0 0 0 ...
##  $ NumStorePurchases  : int  4 2 10 4 6 10 7 4 2 0 ...
##  $ NumWebVisitsMonth  : int  7 5 4 6 5 6 6 8 9 20 ...
##  $ AcceptedCmp3       : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ AcceptedCmp4       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp5       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp1       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AcceptedCmp2       : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Z_CostContact      : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ Z_Revenue          : int  11 11 11 11 11 11 11 11 11 11 ...
##  $ Response           : int  1 0 0 0 0 0 0 0 1 0 ...

Summary

pander(summary(data))
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Table continues below
ID Year_Birth Education Marital_Status
Min. : 0 Min. :1893 Length:2240 Length:2240
1st Qu.: 2828 1st Qu.:1959 Class :character Class :character
Median : 5458 Median :1970 Mode :character Mode :character
Mean : 5592 Mean :1969 NA NA
3rd Qu.: 8428 3rd Qu.:1977 NA NA
Max. :11191 Max. :1996 NA NA
NA NA NA NA
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Table continues below
Income Kidhome Teenhome Dt_Customer
Min. : 1730 Min. :0.0000 Min. :0.0000 Length:2240
1st Qu.: 35303 1st Qu.:0.0000 1st Qu.:0.0000 Class :character
Median : 51382 Median :0.0000 Median :0.0000 Mode :character
Mean : 52247 Mean :0.4442 Mean :0.5062 NA
3rd Qu.: 68522 3rd Qu.:1.0000 3rd Qu.:1.0000 NA
Max. :666666 Max. :2.0000 Max. :2.0000 NA
NA’s :24 NA NA NA
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Table continues below
Recency MntWines MntFruits MntMeatProducts
Min. : 0.00 Min. : 0.00 Min. : 0.0 Min. : 0.0
1st Qu.:24.00 1st Qu.: 23.75 1st Qu.: 1.0 1st Qu.: 16.0
Median :49.00 Median : 173.50 Median : 8.0 Median : 67.0
Mean :49.11 Mean : 303.94 Mean : 26.3 Mean : 166.9
3rd Qu.:74.00 3rd Qu.: 504.25 3rd Qu.: 33.0 3rd Qu.: 232.0
Max. :99.00 Max. :1493.00 Max. :199.0 Max. :1725.0
NA NA NA NA
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Table continues below
MntFishProducts MntSweetProducts MntGoldProds NumDealsPurchases
Min. : 0.00 Min. : 0.00 Min. : 0.00 Min. : 0.000
1st Qu.: 3.00 1st Qu.: 1.00 1st Qu.: 9.00 1st Qu.: 1.000
Median : 12.00 Median : 8.00 Median : 24.00 Median : 2.000
Mean : 37.53 Mean : 27.06 Mean : 44.02 Mean : 2.325
3rd Qu.: 50.00 3rd Qu.: 33.00 3rd Qu.: 56.00 3rd Qu.: 3.000
Max. :259.00 Max. :263.00 Max. :362.00 Max. :15.000
NA NA NA NA
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Table continues below
NumWebPurchases NumCatalogPurchases NumStorePurchases NumWebVisitsMonth
Min. : 0.000 Min. : 0.000 Min. : 0.00 Min. : 0.000
1st Qu.: 2.000 1st Qu.: 0.000 1st Qu.: 3.00 1st Qu.: 3.000
Median : 4.000 Median : 2.000 Median : 5.00 Median : 6.000
Mean : 4.085 Mean : 2.662 Mean : 5.79 Mean : 5.317
3rd Qu.: 6.000 3rd Qu.: 4.000 3rd Qu.: 8.00 3rd Qu.: 7.000
Max. :27.000 Max. :28.000 Max. :13.00 Max. :20.000
NA NA NA NA
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Table continues below
AcceptedCmp3 AcceptedCmp4 AcceptedCmp5 AcceptedCmp1
Min. :0.00000 Min. :0.00000 Min. :0.00000 Min. :0.00000
1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000 1st Qu.:0.00000
Median :0.00000 Median :0.00000 Median :0.00000 Median :0.00000
Mean :0.07277 Mean :0.07455 Mean :0.07277 Mean :0.06429
3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000 3rd Qu.:0.00000
Max. :1.00000 Max. :1.00000 Max. :1.00000 Max. :1.00000
NA NA NA NA
<<<<<<< HEAD ======= >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
AcceptedCmp2 Complain Z_CostContact Z_Revenue Response
Min. :0.00000 Min. :0.000000 Min. :3 Min. :11 Min. :0.0000
1st Qu.:0.00000 1st Qu.:0.000000 1st Qu.:3 1st Qu.:11 1st Qu.:0.0000
Median :0.00000 Median :0.000000 Median :3 Median :11 Median :0.0000
Mean :0.01339 Mean :0.009375 Mean :3 Mean :11 Mean :0.1491
3rd Qu.:0.00000 3rd Qu.:0.000000 3rd Qu.:3 3rd Qu.:11 3rd Qu.:0.0000
Max. :1.00000 Max. :1.000000 Max. :3 Max. :11 Max. :1.0000
NA NA NA NA NA

Densities

data %>% 
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scales = "free") + 
  geom_histogram()
<<<<<<< HEAD

=======

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

We are only interested in the total number of promotions accepted by the customers, since we don’t have details about the nature of each promotion.

data$AcceptedCmpTotal <- data$AcceptedCmp1 + data$AcceptedCmp2 + data$AcceptedCmp3 + data$AcceptedCmp4 + data$AcceptedCmp5 + data$Response 

The dataframe contains many variables, some are superfluous for our analysis (web visits and purchases, cumplains, catalog purchases, Z_Revenus and Z_CostContact which we don’t have information about)

data$Complain <- data$NumWebVisitsMonth <- data$NumWebPurchases <- data$NumCatalogPurchases <- data$Z_Revenue <- data$Z_CostContact <- data$AcceptedCmp1 <- data$AcceptedCmp2 <- data$AcceptedCmp3 <- data$AcceptedCmp4 <- data$AcceptedCmp5 <- data$Response <- NULL

We want to calculate the age of the customers. If we proceed with : “2021 - data$Year_Birth”, we would get their current age. It makes more sense to get their age at the moment the data was registered, so we proceed with 2014 minus dataYear_Birth, although here we are only assuming that it was indeed registered in 2014.

data$age <- 2014 - data$Year_Birth 
plot(data$age)
<<<<<<< HEAD

We see 3 outliers who seems to be older than 110 years old. The corresponding birth years are 1893, 1900 and 1899. The first one could be corrected by 1993, the second one would be due to 2 typing errors which is improbable, and the third could be replaced by 1999 but it corresponds to someone who has a PhD education level, which is unlikely at age 15. Since the dataset is very big, we can choose to delete these lines.

# which(data$age>110)
data <- data[-c(193, 240, 340),]

Marital status can be simplified in only a few levels, and transformed into a factor. Since the “other” section represents less than 1% of the participants, it is not enough to model it as a factor.
We then transform some relevant variables into factors.

=======

We see 3 outliers who seems to be older than 110 years old. The corresponding birth years are 1893, 1900 and 1899. The first one could be corrected by 1993, the second one would be due to 2 typing errors which is improbable, and the third could be replaced by 1999 but it corresponds to someone who has a PhD education level, which is unlikely at age 15. Since the dataset is very big, we can choose to delete these lines.

# which(data$age>110)
data <- data[-c(193, 240, 340),]

Marital status can be simplified in only a few levels, and transformed into a factor. Since the “other” section represents less than 1% of the participants, it is not enough to model it as a factor.

We then transform some relevant variables into factors.

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
data$Marital_Status <-  factor(data$Marital_Status, labels = c("Other", "Single", "Single", "Married", "Single", "Together", "Single", "Other"))
data$Marital_Status[data$Marital_Status=="Other"] <- NA; data$Marital_Status = droplevels(data$Marital_Status)
data$ID <- factor(data$ID) 
data$Education <- factor(data$Education) 
data$Teenhome <-  factor(data$Teenhome)
data$AcceptedCmpTotal <-  factor(data$AcceptedCmpTotal)
data$Kidhome <-  factor(data$Kidhome, labels = c("no", "yes", "yes"))

For Kidhome, we fused the answers “1” and “2” because there are only 2% of “2” which is not enough information to model it a one separate factor.

We now want to plot all the variables again, and check again whether anything is abnormal.

data %>% 
  keep(is.numeric) %>%
  gather() %>%
  ggplot(aes(value)) +
  facet_wrap(~key, scales = "free") + 
  geom_histogram()
<<<<<<< HEAD

=======

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

III - Hypotheses

Here we will be eyeballing the relationship between various variables, and determine those that seem the most relevant for further analysis.

data %>% 
      filter(!is.na(Marital_Status)) %>% 
<<<<<<< HEAD
      ggplot(aes(AcceptedCmpTotal, Income)) +  geom_violin(trim=FALSE, fill='#A4A4A4', color  ="darkred") 

ggplot(data, aes(Education, MntWines))+ geom_boxplot(outlier.colour =  "red") +  geom_point(position = position_jitter(), color="coral4") 
=======
      ggplot(aes(MntWines, Marital_Status)) + geom_boxplot(na.rm = TRUE)

ggplot(data, aes(Kidhome, MntSweetProducts))+ geom_boxplot(outlier.colour =  "red") +  geom_point(position = position_jitter()) 
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

ggplot(data, aes(x=Income, y=NumStorePurchases)) +
  geom_violin(trim=FALSE, fill='#A4A4A4', color="darkred")+
  theme_minimal()
<<<<<<< HEAD
# geom_boxplot(width=0.05, outlier.colour =  "red") + theme_minimal()



The number of store purchases seems to be the most useful variable to analyse since we might want to determine which profils buy the most in the store. In order to consider NumStorePurchases as a response variables for a linear model, we first have to check normality.

In this case, the data is intrisincally skewed because there are no negative values possible. Skewness is quite high (0.7) but remains tolerable regarding the nature of the data. It has to be kept in mind while performing the model assumption checks.

======= ggplot(data, aes(x=MntWines, y=Marital_Status)) + geom_violin(trim=FALSE, fill='#A4A4A4', color="darkred")+ geom_boxplot(width=0.05) + theme_minimal()

NumStorePurchases seems to be the most useful variable to analyse since marketing analysis might want to determine which profils buy the most in the store. In order to consider NumStorePurchases as a response variables for a linear model, we first have to check normality.

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
ggplot(data, aes(x = NumStorePurchases)) +
  geom_histogram(aes(y = ..density..),
                 colour = 1, fill = "white") +
  geom_density(adjust = 1) + labs(title = "Purchases",
<<<<<<< HEAD
 caption = paste("skewness =", round(moments::skewness(data$NumStorePurchases, na.rm = TRUE),2)))

======= caption = paste("skewness =", round(moments::skewness(data$NumStorePurchases, na.rm = TRUE),2))) # put that into appendix! # ggplot(data, aes(x = sqrt(NumStorePurchases))) + # geom_histogram(aes(y = ..density..), # colour = 1, fill = "white") + # geom_density(adjust = 1) + labs(title = "Purchases", # caption = paste("skewness =", round(moments::skewness(data$NumStorePurchases, na.rm = TRUE),2)))

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01

IV - Modelling

<<<<<<< HEAD

A. Linear model

A.1. Model creation and assumptions


m1 <- lm(data=data, NumStorePurchases ~ Kidhome*Income*Education*age) 
plot(m1, c(1:2,4), ask=F)



Looking at the QQplot, we don’t see any specific patterns appearing, which means the residuals are distributed normally. Although we do observe smaller ends for the lowest theoritical quantiles.

For the Cook’s distance, we clearly observe an outlier (observation 2234), while the vast majority of the values are under 1.

For the comparison of residuals VS fitted values,

ggplot(data, aes(NumStorePurchases, Income)) + geom_point(color="blue", alpha=0.3, position = position_jitter())

Here we observe peculiar outliers for “Income”. One income is equal to 666’666, and for income values higher than 150’000, people did not respond. We choose to remove the responses of these customers because they probably did not write the true value of their income.

newdf = data %>%
  filter(!ID %in% c(9432, 5555, 4619, 5336, 1501, 1503, 8475, 4931, 11181) ) 

A.2. Reassessing without the outliers


m1 <- lm(data=newdf, NumStorePurchases ~ Kidhome*Income*Education*age)
plot(m1, c(1:2,4), ask=F)



Paying attention to the QQplot, we see that extreme points are more normal. The Cook’s distance graph also doesn’t show values higher than 0,5.

A.3. Model selection

Here, we use the stepAIC function to filter the variables to keep, by choosing the best AIC.

ms <- MASS::stepAIC(m1, direction = "both", trace = FALSE) 
=======

A. LINEAR MODEL

A.1. Creating the model and assessing assumptions

m1 <- lm(data=data, NumStorePurchases ~ Kidhome*Income*Education*age) 
plot(m1, c(1:2,4), ask=F)

ggplot(data, aes(NumStorePurchases, Income)) + geom_point(color="blue", alpha=0.3, position = position_jitter())

# plot(Income ~NumStorePurchases, col="lightblue", pch=19, cex=2,data)
# text(Income ~NumStorePurchases, labels=ID,data, cex=0.9, font=1)

Here we observe peculiar outliers for “Income”. One income is equal to 666’666, and when income is higher than 150’000

newdf = data %>%
  filter(!ID %in% c(9432, 5555, 4619, 5336, 1501, 1503, 8475, 4931, 11181) ) 
# plot(Income ~NumStorePurchases, col="lightblue", pch=19, cex=2,data=newdf) # odnt need that anbynmore

A.2. Reassessing without the outliers

m1 <- lm(data=newdf, NumStorePurchases ~ Kidhome*Income*Education*age)
plot(m1, c(1:2,4), ask=F)

A.3. Model selection

Here we use the stepAIC function to select the model that has the best AIC.

ms <- MASS::stepAIC(m1, direction = "both", trace = FALSE) #il choisit le meilleur AIC
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
ms$anova
## Stepwise Model Path 
## Analysis of Deviance Table
## 
## Initial Model:
## NumStorePurchases ~ Kidhome * Income * Education * age
## 
## Final Model:
## NumStorePurchases ~ Kidhome + Income + Education + age + Kidhome:Income + 
##     Kidhome:Education + Income:Education + Kidhome:age + Income:age + 
##     Education:age + Kidhome:Income:Education + Income:Education:age
## 
## 
##                             Step Df  Deviance Resid. Df Resid. Dev      AIC
## 1                                                  2164   11138.08 3650.691
## 2 - Kidhome:Income:Education:age  4 26.205697      2168   11164.28 3647.870
## 3        - Kidhome:Education:age  4 14.097325      2172   11178.38 3642.652
## 4           - Kidhome:Income:age  1  8.240739      2173   11186.62 3642.276
<<<<<<< HEAD

The final model chosen here is:

finalm1 <- lm(data=newdf,NumStorePurchases ~ Kidhome + Income + Education + age + Kidhome:Income + Kidhome:Education + Income:Education + Kidhome:age + Income:age + Education:age + Kidhome:Income:Education + Income:Education:age)

A.4. Statistical inference

We first use the eta_squared function to compute the effect sizes: everything that has 0.00 on the left of the 90% CI column has a “meaningless” effect size, but we still keep them on the model.

=======

A.4. Computing the final model

finalm1 <- lm(data=newdf,NumStorePurchases ~ Kidhome + Income + Education + age + Kidhome:Income + Kidhome:Education + Income:Education + Kidhome:age + Income:age + Education:age + Kidhome:Income:Education + Income:Education:age)

A.5. Statisical inference

We first use the eta_squared function to compute the effect sizes: everything that has 0.00 on the left of the 90% CI column has a “meaningless” effect size.

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
parameters::model_parameters(anova(finalm1))
## Parameter                | Sum_Squares |   df | Mean_Square |       F |      p
## ------------------------------------------------------------------------------
## Kidhome                  |     6242.08 |    1 |     6242.08 | 1212.52 | < .001
## Income                   |     5446.16 |    1 |     5446.16 | 1057.92 | < .001
## Education                |       10.29 |    4 |        2.57 |    0.50 | 0.736 
## age                      |       24.65 |    1 |       24.65 |    4.79 | 0.029 
## Kidhome:Income           |        4.87 |    1 |        4.87 |    0.95 | 0.331 
## Kidhome:Education        |       31.63 |    4 |        7.91 |    1.54 | 0.189 
## Income:Education         |       26.11 |    4 |        6.53 |    1.27 | 0.280 
## Kidhome:age              |       36.69 |    1 |       36.69 |    7.13 | 0.008 
## Income:age               |        5.97 |    1 |        5.97 |    1.16 | 0.282 
## Education:age            |       11.41 |    4 |        2.85 |    0.55 | 0.696 
## Kidhome:Income:Education |       52.82 |    4 |       13.20 |    2.56 | 0.037 
## Income:Education:age     |       49.16 |    4 |       12.29 |    2.39 | 0.049 
## Residuals                |    11186.62 | 2173 |        5.15 |         |       
## 
## Anova Table (Type 1 tests)
<<<<<<< HEAD
effectsize::eta_squared(car::Anova(finalm1, type = 2), ci = 0.9, alternative = "two") 
=======
effectsize::eta_squared(car::Anova(finalm1, type = 2), ci = 0.9, alternative = "two") #modified this a bit
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
## # Effect Size for ANOVA (Type II)
## 
## Parameter                | Eta2 (partial) |       90% CI
## --------------------------------------------------------
## Kidhome                  |           0.05 | [0.04, 0.07]
## Income                   |           0.31 | [0.29, 0.34]
## Education                |       4.56e-04 | [0.00, 0.00]
## age                      |       2.42e-03 | [0.00, 0.01]
## Kidhome:Income           |       1.20e-03 | [0.00, 0.00]
## Kidhome:Education        |       2.86e-03 | [0.00, 0.01]
## Income:Education         |       2.28e-03 | [0.00, 0.01]
## Kidhome:age              |       3.98e-03 | [0.00, 0.01]
## Income:age               |       7.07e-04 | [0.00, 0.00]
## Education:age            |       1.07e-03 | [0.00, 0.00]
## Kidhome:Income:Education |       5.30e-03 | [0.00, 0.01]
## Income:Education:age     |       4.38e-03 | [0.00, 0.01]
<<<<<<< HEAD

#

A.6. Plots and summary estimates

Plot estimates


We call the sjPlot function to plot all the estimates or to plot only one term at a time.

sjPlot::plot_model(finalm1) 

Plot marginal effects

We call the sjPlot function to plot all the estimates or to plot only one term at a time.

=======

A.6. Plots and summary estimates

Plot estimates

sjPlot::plot_model(finalm1) 

Plot marginal effects

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
sjPlot::plot_model(finalm1, type = "pred", terms = "Kidhome", show.data  = T, jitter = 1) 
sjPlot::plot_model(finalm1, type = "pred", terms = "Income", show.data  = T, jitter = 1) 
sjPlot::plot_model(finalm1, type = "pred", terms = c("Income", "Kidhome"), show.data  = T, jitter = 1) 
sjPlot::plot_model(finalm1, type = "pred", terms = c("Income", "Education"), show.data  = T, jitter = 1) 
sjPlot::plot_model(finalm1, type = "pred", terms = c("Income", "age"), show.data  = T, jitter = 1) 
<<<<<<< HEAD
sjPlot::plot_model(finalm1, type = "pred", terms = c("age", "Kidhome"), show.data  = T, jitter = 1) 
sjPlot::plot_model(finalm1, type = "pred", terms = c("age", "Income", "Education"), show.data  = T, jitter = 1) 



As we could easily assume, the number of store purchases increases along with the income. A more counter-intuitive finding is that it’s lower for customers who have kids ; this could be due to the fact that this dataframe doesn’t take into account the number of store purchases of the other parents.

Looking at the interaction effects between education and income, we see that for the same income, customers who received basic education are clearly separated from the other customers by purchasing less.

For the interaction between income and age, an interesting pattern appears : for the lowest incomes, the number of store purchases decreases with age, while the tendency is reversed for the highest wages.

We also observe an interaction effect between having kids at home and age : the number of store purchases stays stable for those who don’t have kids at home, while it decreases for those who do have kids at home, possibly because when the parents get older, kids leave the house.

Finally, over time, the augmentation of the number of store purchases proportional to the salary stays stable except for every education level except for those who received basic education.

Summary table of standardized estimates

We want a table that summarizes all the standardized estimates in a numeric form.

sjPlot::tab_model(effectsize::standardize(finalm1), rm.terms = c("*Education.Q", "Education^4", "Income:Education.C", "Education.Q" , "Kidhomeyes:Education.Q", "Income:Education.Q", "Kidhomeyes:Income:Education.Q", "Education.C" ,              "Kidhomeyes:Education.C"  ,      "Income:Education.C"  ,"Kidhomeyes:Income:Education.C", "Education^4",  "Kidhomeyes:Education^4", "Income:Education^4" , "Kidhomeyes:Income:Education^4"), show.intercept = F)   #
======= sjPlot::plot_model(finalm1, type = "pred", terms = c("age", "Kidhome"), show.data = T, jitter = 1)

Summary table of standardized estimates

sjPlot::tab_model(effectsize::standardize(finalm1), rm.terms = c("*Education.Q", "Education^4", "Income:Education.C", "Education.Q" , "Kidhomeyes:Education.Q", "Income:Education.Q", "Kidhomeyes:Income:Education.Q", "Education.C" ,              "Kidhomeyes:Education.C"  ,      "Income:Education.C"  ,"Kidhomeyes:Income:Education.C", "Education^4",  "Kidhomeyes:Education^4", "Income:Education^4" , "Kidhomeyes:Income:Education^4"), show.intercept = F)  #what is this for? # this to show all the STANDARDIZED estimates on a numeric form in a table summarizing all thE information
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
  <<<<<<< HEAD NumStorePurchases ======= Num Store Purchases >>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
Predictors Estimates CI p
Kidhome [yes] -0.58 -0.83 – -0.32 <0.001
Income 0.61 0.48 – 0.74 <0.001
Education [Basic] -0.62 -2.03 – 0.79 0.391
Education [Graduation] -0.07 -0.23 – 0.08 0.356
Education [Master] -0.11 -0.29 – 0.08 0.254
Education [PhD] 0.04 -0.14 – 0.22 0.663
age 0.03 -0.09 – 0.14 0.668
Kidhome [yes] * Income -0.11 -0.35 – 0.13 0.383
Kidhome [yes] * Education
[Basic]
-0.76 -3.00 – 1.47 0.503
Kidhome [yes] * Education
[Graduation]
0.17 -0.10 – 0.45 0.214
Kidhome [yes] * Education
[Master]
0.30 -0.01 – 0.61 0.061
Kidhome [yes] * Education
[PhD]
0.09 -0.22 – 0.40 0.555
Income * Education
[Basic]
-0.32 -1.28 – 0.65 0.522
Income * Education
[Graduation]
-0.03 -0.18 – 0.12 0.684
Income * Education
[Master]
0.01 -0.16 – 0.18 0.894
Income * Education [PhD] -0.17 -0.34 – -0.00 0.048
Kidhome [yes] * age -0.12 -0.20 – -0.04 0.003
Income * age 0.03 -0.07 – 0.13 0.570
Education [Basic] * age -0.66 -1.79 – 0.47 0.253
Education [Graduation] *
age
-0.03 -0.15 – 0.09 0.644
Education [Master] * age 0.02 -0.12 – 0.16 0.761
Education [PhD] * age -0.06 -0.20 – 0.07 0.342
(Kidhome [yes] * Income)
* Education [Basic]
-0.63 -2.07 – 0.81 0.392
(Kidhome [yes] * Income)
* Education [Graduation]
0.12 -0.14 – 0.38 0.374
(Kidhome [yes] * Income)
* Education [Master]
0.23 -0.08 – 0.54 0.153
(Kidhome [yes] * Income)
* Education [PhD]
0.44 0.13 – 0.76 0.005
(Income * Education
[Basic]) * age
-0.47 -1.19 – 0.26 0.209
(Income * Education
[Graduation]) * age
-0.09 -0.20 – 0.02 0.113
(Income * Education
[Master]) * age
-0.04 -0.16 – 0.09 0.574
(Income * Education
[PhD]) * age
0.02 -0.10 – 0.15 0.716
Observations 2204
R2 / R2 adjusted 0.516 / 0.510
<<<<<<< HEAD


We see that R^2 and adjusted R^2 have acceptable values.

B. Principal component analysis

clean_data <-data[rowSums(is.na(data))==0, ]
pm1<-prcomp(clean_data[,-c(1,3,4,6,7,8,18)],  scale=TRUE)
summary(pm1)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.0965 1.4242 1.04714 1.00079 0.90502 0.80789 0.70029
## Proportion of Variance 0.3663 0.1690 0.09138 0.08346 0.06826 0.05439 0.04087
## Cumulative Proportion  0.3663 0.5353 0.62667 0.71013 0.77839 0.83278 0.87364
##                            PC8     PC9    PC10    PC11      PC12
## Standard deviation     0.65473 0.64171 0.61837 0.54168 6.551e-15
## Proportion of Variance 0.03572 0.03432 0.03187 0.02445 0.000e+00
## Cumulative Proportion  0.90937 0.94368 0.97555 1.00000 1.000e+00


Looking at cumulative proportion, we need 8 principal components to reach 0,90 proportion of variance.

We also see that PC6 is non-significant.

We delete PC6 from the model, and this new model m3 is not significantly different from m2.

Therefore, we keep the simplest, which is m3. With summary(m3), we see that all PCs are significant. We keep the model m3.

data2 <- cbind(clean_data, pm1$x)
m2 <- lm(data=data2, NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8)
summary(m2)
## 
## Call:
## lm(formula = NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
##     PC6 + PC7 + PC8, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2939 -0.3155 -0.0040  0.3936  5.8489 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  5.80444    0.02159  268.817  < 2e-16 ***
## PC1         -1.16939    0.01030 -113.515  < 2e-16 ***
## PC2          0.12699    0.01517    8.374  < 2e-16 ***
## PC3         -0.86347    0.02063  -41.865  < 2e-16 ***
## PC4         -0.12584    0.02158   -5.831 6.32e-09 ***
## PC5         -0.69782    0.02386  -29.242  < 2e-16 ***
## PC6         -0.02038    0.02673   -0.762  0.44602    
## PC7         -2.15573    0.03084  -69.899  < 2e-16 ***
## PC8         -0.10767    0.03299   -3.264  0.00112 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.015 on 2200 degrees of freedom
## Multiple R-squared:  0.9031, Adjusted R-squared:  0.9027 
## F-statistic:  2562 on 8 and 2200 DF,  p-value: < 2.2e-16
m3 <- update(m2, . ~ . - PC6)
anova(m3, m2)
## Analysis of Variance Table
## 
## Model 1: NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC7 + PC8
## Model 2: NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + 
##     PC8
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1   2201 2266.4                           
## 2   2200 2265.8  1   0.59833 0.5809  0.446
summary(m3)
## 
## Call:
## lm(formula = NumStorePurchases ~ PC1 + PC2 + PC3 + PC4 + PC5 + 
##     PC7 + PC8, data = data2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.2783 -0.3167 -0.0046  0.3917  5.8675 
## 
## Coefficients:
##             Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)  5.80444    0.02159  268.843  < 2e-16 ***
## PC1         -1.16939    0.01030 -113.525  < 2e-16 ***
## PC2          0.12699    0.01516    8.374  < 2e-16 ***
## PC3         -0.86347    0.02062  -41.869  < 2e-16 ***
## PC4         -0.12584    0.02158   -5.832  6.3e-09 ***
## PC5         -0.69782    0.02386  -29.245  < 2e-16 ***
## PC7         -2.15573    0.03084  -69.906  < 2e-16 ***
## PC8         -0.10767    0.03298   -3.264  0.00111 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.015 on 2201 degrees of freedom
## Multiple R-squared:  0.903,  Adjusted R-squared:  0.9027 
## F-statistic:  2928 on 7 and 2201 DF,  p-value: < 2.2e-16
=======

B. PRINCIPAL COMPONENT ANALYSIS

clean_data <-data[rowSums(is.na(data))==0, ]
pm1<-prcomp(clean_data[,-c(1,3,4,6,7,8,18)], scale=TRUE)
# summary(pm1)

data2 <- cbind(clean_data, pm1$x) #double check!
# str(data2)
>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01
sjPlot::tab_pca(pm1, show.cronb = F, show.var = T, nmbr.fctr = 8)
Principal Component Analysis
  Component 1 Component 2 Component 3 Component 4 Component 5 Component 6 Component 7 Component 8
Year_Birth 0.00 -1.00 0.02 -0.01 0.05 -0.02 0.06 -0.00
Income -0.14 0.12 0.05 -0.02 -0.81 0.10 -0.29 0.20
Recency 0.00 0.01 -0.00 1.00 -0.00 0.01 -0.00 0.01
MntWines -0.15 0.09 0.00 0.02 -0.52 0.18 -0.70 0.01
MntFruits -0.86 -0.01 0.06 -0.01 -0.15 0.11 -0.23 0.12
MntMeatProducts -0.52 -0.02 0.04 0.03 -0.71 0.09 -0.17 0.10
MntFishProducts -0.71 0.02 0.08 -0.01 -0.26 0.22 -0.13 0.31
MntSweetProducts -0.40 -0.01 0.05 0.02 -0.24 0.12 -0.19 0.83
MntGoldProds -0.23 0.03 -0.04 0.01 -0.15 0.93 -0.18 0.10
NumDealsPurchases 0.09 0.04 -0.99 0.00 0.05 0.04 -0.05 -0.04
NumStorePurchases -0.28 0.08 -0.07 -0.01 -0.19 0.12 -0.85 0.20
age -0.00 1.00 -0.02 0.01 -0.05 0.02 -0.06 0.00
Proportion of Variance 36.63 % 16.90 % 9.14 % 8.35 % 6.83 % 5.44 % 4.09 % 3.57 %
Cumulative Proportion 36.63 % 53.53 % 62.67 % 71.01 % 77.84 % 83.28 % 87.36 % 90.94 %
varimax-rotation
<<<<<<< HEAD

Conclusions

=======

We need 8 principal components to attain 0,90

>>>>>>> 4a81f60922d2c9d4762193776d1d90d913e8ea01